Poznámky

##Návod na přidání nové proměnné do analýzy

  • Ve snipetu Loading Jedlička přidat do select nebo do mutate
  • Ve snipetu “Loading Baker” přidat čésůp číslo sloupceš (číslo sloupce v Excelu + 1) a o řádek níž upravit první argument funnkce prejemnuj a v druhám přídat název sloupce.

Úvod a příprava dat

Reprodukujeme následující článek a grafy v něm: (https://www.nature.com/articles/533452a)[https://www.nature.com/articles/533452a]

Načtení dat

Data byla pipravená pomocí skriptu preprocess.Rmd, z něj si lze vybrat dva soubory, jeden by měl být hůře připravený, ale s úplně všemi proměnnými, na které jsme se ptali (dataAll.RData) a druhý by měl být selektovaný, ale lépe připravený (dataProcessed.RData). Zkusím je nedřív porovnat.

load("dataAll.RData")
complete = data

load("dataProcessed.RData")
selected = data

ncol(complete) == ncol(selected)
## [1] FALSE
ncol(complete) ; ncol(selected)
## [1] 161
## [1] 69

Fooo-hooo! Tak z připravených dat vypadlo skoro 100 proměnných!!! Jasně, už si to vybavuju! Celé bloky na faktory neobjektivity šly pryč, protože to nebylo primární v dosavatních textech a zprávách. Tak asi začnu tím, že se kouknu, co v Nature dělali a jestli to v obou souborech.

Hledání proměnných pro komparaci s Baker in Nature

tibble(
  `Baker Graph` = c("Is there reproducibility crisis?",
            "HOW MUCH PUBLISHED WORK IN YOUR FIELD IS REPRODUCIBLE?",
            "Obory (Chemistry, Phycs & Engineering, Earth & Environment, Biology, Medicine, Other)",
            "HAVE YOU FAILED TO REPRODUCE AN EXPERIMENT? (Own/Someone's else)",
            "WHAT FACTORS CONTRIBUTE TO IRREPRODUCIBLE RESEARCH?",
            "HAVE YOU EVER TRIED TO PUBLISH A REPRODUCTION ATTEMPT? (Successful/Unsuccessful)",
            "HAVE YOU ESTABLISHED PROCEDURES FOR REPRODUCIBILITY?",
            "---",
            "---"),
  `Baker Var` = c("(20) Which of the following statement regarding a 'crisis or reproducibility' within the science community do you agree with?",
               "(22) In your opinion, what proportion of published results in your field are reproducible? i.e. the results of a given study could be replicated exactly or reproduced in multiple similar experimental systems with variations of experimental settings such as materials and experimental model)",
               "(91) Which of the following best describes your area of interest?",
               "(79) Which, if any, of the following have you done?, (80) ...80",
               "(52--65) Please use the scale below to indicate how much each of the following factors contributes to a failure to reproduce results:, ...53 -- ...65",
               "(81--84) ...81, ...82, ...83, ...84 ",
               "(40) Have you and/or your lab group established any procedures to ensure reproducibility in your work?, (42) When did you and/or your lab group establish these procedures?",
               "(85) Has anyone ever told you that they could not reproduce results from one of your own experiments?",
               "(109) In which continent do you live?, (110--115) Which country in ..."),
  `My Var: complete` = c("krize_rep",
               "rep_podil", 
               "spec_hlavni",
               "nerep_jaMuj, nerep_jaCizi",
               "fakt_podvod -- fakt_smula",
               "---", 
               "---",
               "nerep_oniMuj",
               "---"),
  `My Var: selected` = c("krize_rep",
               "---",
               "spec_hlavni",
               "---",
               "---",
               "---",
               "---",
               "---",
               "---")
) %>% kable()
Baker Graph Baker Var My Var: complete My Var: selected
Is there reproducibility crisis? (20) Which of the following statement regarding a ‘crisis or reproducibility’ within the science community do you agree with? krize_rep krize_rep
HOW MUCH PUBLISHED WORK IN YOUR FIELD IS REPRODUCIBLE? (22) In your opinion, what proportion of published results in your field are reproducible? i.e. the results of a given study could be replicated exactly or reproduced in multiple similar experimental systems with variations of experimental settings such as materials and experimental model) rep_podil
Obory (Chemistry, Phycs & Engineering, Earth & Environment, Biology, Medicine, Other) (91) Which of the following best describes your area of interest? spec_hlavni spec_hlavni
HAVE YOU FAILED TO REPRODUCE AN EXPERIMENT? (Own/Someone’s else) (79) Which, if any, of the following have you done?, (80) …80 nerep_jaMuj, nerep_jaCizi
WHAT FACTORS CONTRIBUTE TO IRREPRODUCIBLE RESEARCH? (52–65) Please use the scale below to indicate how much each of the following factors contributes to a failure to reproduce results:, …53 – …65 fakt_podvod – fakt_smula
HAVE YOU EVER TRIED TO PUBLISH A REPRODUCTION ATTEMPT? (Successful/Unsuccessful) (81–84) …81, …82, …83, …84
HAVE YOU ESTABLISHED PROCEDURES FOR REPRODUCIBILITY? (40) Have you and/or your lab group established any procedures to ensure reproducibility in your work?, (42) When did you and/or your lab group establish these procedures?
(85) Has anyone ever told you that they could not reproduce results from one of your own experiments? nerep_oniMuj
(109) In which continent do you live?, (110–115) Which country in …

Příprava a spojování dat

Teď jsme si udělali jasno ohledně možné komparace dat z obou výzkumů. Je jasné, že je třeba vzít dataAll.RData, neboť v těch selektovaných datech nic pořádně na srovnání s Baker není. Teď si tedy podle tabulky výše upravím objekt complete a z něj vyberu vše, co je komparovatelné.

my = filter(complete, cas_sec >= 600, zakl_vyzkum == "Ano") %>% 
  select(krize_rep, rep_podil, spec_hlavni, nerep_jaMuj, nerep_oniMuj, 
         nerep_jaCizi, fakt_podvod:fakt_smula, 
         starts_with("reg_"), starts_with("nerep_"),  starts_with("desp"),  starts_with("reakce_"),
         frek_nerepre:neobj_salam, gender:kar_kategorie) %>%  
  mutate(
    Author = "Jedlička",
    continent = "Europe",
    Country = "Czech Republic",
    across(
      c(krize_rep, spec_hlavni:fakt_smula, despekt, nerep_problem:nerep_narust),
      ~recode(
        .x,
        Nevím = "I don't know", Ano = "Yes", Ne = "No", `Nepamatuji se` = "I can't remember" ,
        `Nedělám experimenty` = "I don't do experiments",
        `Ano, významná krize` = "Yes, significant crisis",
        `Ano, nevýznamná krize` = "Yes, slight crisis",
        `Ne, žádná krize není` = "No crisis",
        # `Astronomy and planetary science` = "Astronomy and Planetary Science",
        `Mathematics` = "Other specialization",
        Vždy = "Always", Nikdy = "Never", 
        Zřídka = "Rarely", Někdy = "Sometimes",
        `Velmi často` = "Very often",
        `Rozhodně souhlasím` = "Strongly agree", `Spíše souhlasím` = "Agree",
        `Ani souhlasím/ani nesouhlasím` = "Neither agree nor disagree",
        `Spíše nesouhlasím` = "Disagree", `Rozhodně nesouhlasím` = "Strongly disagree"
      )
    ),
    across(
      starts_with("neobj_"),
      ~recode(.x, Ano = "100", Ne = "0") %>% parse_number()
    ),
    across(
      starts_with("frek_"),
      ~parse_number(.x)
    )
  )
## Warning: There were 16 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `across(...)`.
## Caused by warning:
## ! 71 parsing failures.
## row col expected actual
##  60  -- a number  Nevím
##  62  -- a number  Nevím
##  84  -- a number  Nevím
## 119  -- a number  Nevím
## 127  -- a number  Nevím
## ... ... ........ ......
## See problems(...) for more details.
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 15 remaining warnings.

MŠ: Warningy, které hlásí, se týkají případů, kdy v proměnných freq_xxx ne vyplněno Nevím, které překoduje do NA.

Aha… Tak když jsme dali limit 10 minut, tak jsme přišli o 20 lidí, vzorek klesne na 1001, když to dáme na 12 minut, tak to klesne až na 967, to je dalších 33 lidí, tedy přijdeme o 53 lidí, tedy 5% vzorku… No, já to teď nechám na těch 12 minutách, ale stačí říct, že by se těch dalších 33 lidí hodilo a já to celé sjedu s limitem 600 sekund. Ápropós, když limit snížím na 15 minut, tak vzorem klesne až na 878 respondentů.

tibble(`Časový limit (minuty)` = c(0, 10, 12, 15), N = c(1021, 1001, 967, 878)) %>% kable()
Časový limit (minuty) N
0 1021
10 1001
12 967
15 878

Teď musíme provést totéž s daty od Baker. Načteme je, vybereme z nich proměnné a nakonec je budeme rekódovat tak, aby data byla srovnatelná.

MŠ: přidal jsem Country, která slučuje cntr.x

# Definice vlastní funkce pro lepší pejmenování
prejmenuj = function(tdf, pozice, jmena) {
  if (length(jmena) < 1) stop(print('Musíte zadat nějaká jména!'))
  if (length(pozice) < 1) stop(print('Musíte zadat nějaké pozice!'))
  if (length(pozice) !=  length(jmena)) stop(print('délka vektoru nových jmen a délka vektoru pozic se musí shodovat!'))
  for (i in 1:length(jmena)) {
    names(tdf)[pozice[i]] = jmena[i]
  }
  tdf
}


Baker = read_xlsx("../data_and_R_src/dataNature.xlsx", skip = 1) %>% 
  select(c(20, 22, 91, 79:80, 52:65, 85, 89, 24, 25, 50, 51, 109:115)) %>% 
  prejmenuj(1:32, 
            c("krize_rep", "rep_podil", "spec_hlavni", "nerep_jaMuj", "nerep_jaCizi",
              'fakt_podvod', 'fakt_karier', 'fakt_dohled', 'fakt_recenz', 'fakt_selekc',
              'fakt_replik', 'fakt_statis', 'fakt_odborn', 'fakt_data', 'fakt_dokume',
              'fakt_metody', 'fakt_variab', 'fakt_design', 'fakt_smula', 
              "nerep_oniMuj", "educ",
              "nerep_problem", "nerep_problemAll", "nerep_chyba", "nerep_neobj",   
               "continent", "cntr.1", "cntr.2", "cntr.3", "cntr.4", "cntr.5", "cntr.6"
              )) %>% 
  mutate(
      across(
      c(krize_rep:nerep_oniMuj),
      ~recode(
        .x,
        # `I don't know` = `Nevím`, Yes = "Ano", No = "Ne", `I can't remember` = "Nepamatuji se",
        `There is a significant crisis of reproducibility` = "Yes, significant crisis",
        `There is a slight crisis of reproducibility` = "Yes, slight crisis",
        `There is no crisis of reproducibility` = "No crisis",
        `Astronomy and planetary science` = "Astronomy and Planetary Science",
        `Other` = "Other specialization",
        `Always contributes` = "Always", `Never contributes` = "Never",
        `Rarely contributes` = "Rarely", `Sometimes contributes` = "Sometimes",
        `Very often contributes` = "Very often"
      )
    )
  )
## Warning: Expecting logical in CZ1141 / R1141C104: got 'any'
## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `No` -> `No...9`
## • `` -> `...11`
## • `` -> `...12`
## • `No` -> `No...19`
## • `` -> `...20`
## • `` -> `...21`
## • `` -> `...22`
## • `` -> `...23`
## • `` -> `...26`
## • `` -> `...30`
## • `` -> `...34`
## • `` -> `...35`
## • `` -> `...39`
## • `` -> `...40`
## • `` -> `...41`
## • `` -> `...42`
## • `` -> `...43`
## • `` -> `...44`
## • `` -> `...45`
## • `` -> `...46`
## • `` -> `...47`
## • `` -> `...48`
## • `` -> `...49`
## • `` -> `...66`
## • `` -> `...78`
## • `` -> `...85`
## • `` -> `...86`
## • `` -> `...87`
## • `` -> `...88`
## • `` -> `...89`
## • `` -> `...90`
## • `` -> `...91`
## • `` -> `...92`
## • `` -> `...93`
## • `` -> `...94`
## • `` -> `...95`
## • `` -> `...96`
## • `` -> `...97`
## • `` -> `...98`
## • `` -> `...99`
## • `` -> `...100`
## • `` -> `...101`
## • `` -> `...102`
## • `` -> `...103`
## • `` -> `...104`
## • `` -> `...105`
## • `` -> `...106`
## • `` -> `...107`
## • `` -> `...108`
## • `` -> `...109`
## • `` -> `...110`
## • `` -> `...111`
## • `` -> `...112`
## • `` -> `...113`
## • `` -> `...114`
## • `` -> `...115`
  Baker <- Baker %>% 
    rowwise() %>% 
     mutate( Author = "Baker",
       Country = paste0(na.omit(c_across(starts_with("cntr."))), collapse = ", ")
 )

Tak proměnné jsou vybrané a základní příprava hotová, teď můžeme soubory spojit, aby je bylo možné porovnat, dočištění, které lze udělat společně, to uděláme společně teď, tedy hned po spojení:

df = add_rows(my, Baker) %>%  # Takto spojíme soubory
  mutate(
    rep_podil = parse_number(rep_podil),
    across(
      c(krize_rep, spec_hlavni:fakt_smula, despekt, Author), 
      ~factor(
        .x, 
        levels = 
          c("Yes, significant crisis", "Yes, slight crisis", "No crisis", "Yes", "No",
            "I can't remember", "I don't do experiments", "Astronomy and Planetary Science", "Biology", "Chemistry",
            "Earth and Environmental Science", "Engineering", "Materials Science",  # "Mathematics",
            "Medicine", "Physics", "Other specialization", "Baker", "Jedlička",
            "Always", "Very often", "Sometimes", "Rarely", "Never", "Strongly agree", "Agree",
            "Neither agree nor disagree", "Disagree", "Strongly disagree", "I don't know"
            )
        )
     )
  )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `rep_podil = parse_number(rep_podil)`.
## Caused by warning:
## ! 82 parsing failures.
## row col expected actual
##  46  -- a number  Nevím
##  49  -- a number  Nevím
##  59  -- a number  Nevím
##  79  -- a number  Nevím
##  85  -- a number  Nevím
## ... ... ........ ......
## See problems(...) for more details.

MŠ: Warningy viz výše

Grafy

V téhle části jednak zreplikuju grafy Monyi Baker z Nature a rovnou je srovnám s našimi výsledky. Grafy vezmu popořadě.

Is there reproducibility crisis?

dfs = df %>% select(krize_rep, Author) %>% filter(!is.na(krize_rep)) %>%  
  group_by(Author) %>% mutate(N = n()) %>% 
  group_by(Author, krize_rep, N) %>% summarise(n = n()) %>% 
  mutate(f = round(n / N * 100, 1))
## `summarise()` has grouped output by 'Author', 'krize_rep'. You can override
## using the `.groups` argument.
ggplot(dfs, aes(y = fct_rev(krize_rep), x = f, fill = Author, label = paste(f, "%"))) +
  geom_col(alpha = 0.75, na.rm = T, position = position_dodge2(rev = T)) +
  geom_text(position = position_dodge2(width = 0.9, reverse = T)) +
  labs(title = "Is there reproducibility crisis?", x = "%", 
       caption = paste0("N_Baker = ", dfs[1, "N"], 
                        ", N_Jedlička = ", dfs[8, "N"])) +
  # scale_x_log10() +
  # guides(fill = "none") +
  theme_minimal()

Tak tady jsem si všiml, a není to chyba!, že v našem výzkumu chybí 120 pozorování, ti lidi prostě neodpověděli, jiné vysvětlení nemám. My tedy máme 901 platných odpovědí, 120 chybělo, Baker má těch 1576 (obecně myslím, že tam má jenom ty, co odpověděli na všechno, ale teď si tím nejsem jistý).

Jinak, překvapení se nekoná, graf z Nature sedí, čísla jsme reprodukovali.

MŠ, přesněji, stejně jsme to spočítali z jejích dat. Nicméně bych neřekl, že jsme reprodkuovali její výzkum v Čecháh. I pokud budeme zkoumat jen ty, co vědí, ale přiděláme CI, vidíme, že se se ve dvou případech neprotínají.

dfsknowing <- df %>% 
  filter(krize_rep != "I don't know") %>%
  select(krize_rep, Author) %>%
  filter(!is.na(krize_rep)) %>%
  group_by(Author) %>%
  mutate(N = n()) %>%
  group_by(Author, krize_rep, N) %>%
  summarise(n = n(), .groups = "drop") %>%
  mutate(f = round(n / N * 100, 1)) %>%
  mutate(se = sqrt((f * (100 - f)) / N),  # Standard error
         lower = f - 1.96 * se,           # 95% CI Lower Bound
         upper = f + 1.96 * se)           # 95% CI Upper Bound

  
  
dodge_width <- 0.9  # Adjust dodging width for correct spacing

ggplot(dfsknowing, aes(y = fct_rev(krize_rep), x = f, fill = Author, label = paste(f, "%"))) +
  geom_col(alpha = 0.75, na.rm = TRUE, position = position_dodge(width = dodge_width), 
           width = 0.4) +  # Reduce bar width to avoid overlap
  geom_errorbar(aes(xmin = lower, xmax = upper), 
                width = 0.2, 
                position = position_dodge(width = dodge_width)) + 
  geom_text(position = position_dodge(width = dodge_width), hjust = -0.5) +
  labs(title = "Is there a reproducibility crisis?", x = "%", 
       caption = paste0("N_Baker = ", dfsknowing[1, "N"], 
                        ", N_Jedlička = ", dfsknowing[8, "N"])) +
  theme_minimal()

A když uděláme exaktní statistickej test toho, že mezi těmi, co vědí, jsou poměry stejné u nás a u bakerové, vyjde supersignifikantně (že ne).

# Ensure the dataframe is ungrouped
reduced_dfs <- dfsknowing %>%
  select(Author, krize_rep, n) %>%
  ungroup() %>%  # REMOVE GROUPING
  mutate(
    Author = as.character(Author),     # Convert factors to characters
    krize_rep = as.character(krize_rep)
  )


# Convert to contingency table
data_for_chisq <- xtabs(n ~ Author + krize_rep, data = reduced_dfs)



# Perform Chi-Square Test
chi_result <- chisq.test(data_for_chisq)

data_df <- as.data.frame(as.table(data_for_chisq))

# Create the mosaic plot
ggplot(data_df) +
  geom_mosaic(aes(weight = Freq, x = product(Author), fill = krize_rep)) +
  labs(title = "Mosaic Plot of Contingency Table",
       x = "Author",
       y = "krize_rep",
       fill = "krize_rep") +
  theme_minimal()
## Warning: The `scale_name` argument of `continuous_scale()` is deprecated as of ggplot2
## 3.5.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `trans` argument of `continuous_scale()` is deprecated as of ggplot2 3.5.0.
## ℹ Please use the `transform` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `unite_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `unite()` instead.
## ℹ The deprecated feature was likely used in the ggmosaic package.
##   Please report the issue at <https://github.com/haleyjeppson/ggmosaic>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Print results
print(chi_result)
## 
##  Pearson's Chi-squared test
## 
## data:  data_for_chisq
## X-squared = 106.53, df = 2, p-value < 2.2e-16

Takže hypotéza Náhodně vybraný vědecký týpek to má tak a tak v Česku neplatí. Tak ted zkoumejme, zda to vyjde aspoň po zohlednění kontinentu, a rozdělíme Evropu na východní a západní.

# Read the CSV file containing East and West Europe classification
europe_regions <- read.csv("east_and_west_europe.csv")

# Assume `df` is your main data frame with a "Continent" column
# Merge with `df` based on country
df <- df %>%
  left_join(europe_regions, by = "Country") %>%  # Adjust "Country" if needed
  mutate(Subcontinent = ifelse(continent == "Europe", Region, continent)) %>%
  select(-Region)  # Remove the Region column if no longer needed

# Print updated data frame


dfsknowingcontinent <- df %>% 
  filter(krize_rep != "I don't know") %>%
  select(krize_rep, Author, Subcontinent) %>%
  filter(!is.na(krize_rep)) %>%
  group_by(Author, Subcontinent) %>%
  mutate(N = n()) %>%
  group_by(Author, Subcontinent, krize_rep, N) %>%
  summarise(n = n(), .groups = "drop") %>%
  mutate(f = round(n / N * 100, 1)) %>%
  mutate(se = sqrt((f * (100 - f)) / N),  # Standard error
         lower = f - 1.96 * se,           # 95% CI Lower Bound
         upper = f + 1.96 * se)  %>%
  mutate(Author_Cont = paste0(Author,substr(Subcontinent,1,10)))


  
# Ensure the dataframe is ungrouped
reduced_dfs <- dfsknowingcontinent %>%
  select(Author_Cont, krize_rep, n) %>%
  ungroup() %>%  # REMOVE GROUPING
  mutate(
    Author_Cont = as.character(Author_Cont),     # Convert factors to characters
    krize_rep = as.character(krize_rep)
  )


# Convert to contingency table
data_for_chisq <- xtabs(n ~ Author_Cont + krize_rep, data = reduced_dfs)

data_df <- as.data.frame(as.table(data_for_chisq))

# Create the mosaic plot
ggplot(data_df) +
  geom_mosaic(aes(weight = Freq, x = product(Author_Cont), fill = krize_rep)) +
  labs(title = "Mosaic Plot of Contingency Table",
       x = "Author_Cont",
       y = "krize_rep",
       fill = "krize_rep") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),  # Rotate X labels
    legend.position = "none"  # Remove legend
  )

A když udělám chi-sq test pro naše a Bakerové východně evropské respondenty

# Ensure the dataframe is ungrouped
reduced_dfs_ac <- dfsknowingcontinent %>%
  filter(Subcontinent == "East Europe") %>%
  select(Author, krize_rep, n) %>%
  ungroup() %>%  # REMOVE GROUPING
  mutate(
    Author = as.character(Author),     # Convert factors to characters
    krize_rep = as.character(krize_rep)
  )


# Convert to contingency table
data_for_chisq_ac <- xtabs(n ~ Author + krize_rep, data = reduced_dfs_ac)

# Perform Chi-Square Test
chi_result <- chisq.test(data_for_chisq_ac)


# Convert contingency table to a data frame
data_df <- as.data.frame(as.table(data_for_chisq_ac))

# Create the mosaic plot
ggplot(data_df) +
  geom_mosaic(aes(weight = Freq, x = product(Author), fill = krize_rep)) +
  labs(title = "Mosaic Plot of Contingency Table",
       x = "Author",
       y = "krize_rep",
       fill = "krize_rep") +
  theme_minimal()

# Print results
print(chi_result)
## 
##  Pearson's Chi-squared test
## 
## data:  data_for_chisq_ac
## X-squared = 9.1833, df = 2, p-value = 0.01014

…tak furt nic. Co specializace

dfsknowingcspec <- df %>% 
  filter(krize_rep != "I don't know") %>%
  select(krize_rep, Author, spec_hlavni) %>%
  filter(!is.na(krize_rep)) %>%
  group_by(Author, spec_hlavni) %>%
  mutate(N = n()) %>%
  group_by(Author, spec_hlavni, krize_rep, N) %>%
  summarise(n = n(), .groups = "drop") %>%
  mutate(f = round(n / N * 100, 1)) %>%
  mutate(se = sqrt((f * (100 - f)) / N),  # Standard error
         lower = f - 1.96 * se,           # 95% CI Lower Bound
         upper = f + 1.96 * se)  %>%
  mutate(Author_Spec = paste0(Author,substr(spec_hlavni,1,3)))

reduced_dfs <- dfsknowingcspec %>%
  select(Author_Spec, krize_rep, n) %>%
  ungroup() %>%  # REMOVE GROUPING
  mutate(
    Author_Spec = as.character(Author_Spec),     # Convert factors to characters
    krize_rep = as.character(krize_rep)
  )


# Convert to contingency table
data_for_chisq <- xtabs(n ~ Author_Spec + krize_rep, data = reduced_dfs)

data_df <- as.data.frame(as.table(data_for_chisq))

# Create the mosaic plot
ggplot(data_df) +
  geom_mosaic(aes(weight = Freq, x = product(Author_Spec), fill = krize_rep)) +
  labs(title = "Mosaic Plot of Contingency Table",
       x = "Author_Cont",
       y = "krize_rep",
       fill = "krize_rep") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),  # Rotate X labels
    legend.position = "none"  # Remove legend
  )

Tak tím to taky nebude.

HOW MUCH PUBLISHED WORK IN YOUR FIELD IS REPRODUCIBLE?

Ještě jsem neudělal graf a už tuším problém :( Nikde jsem nenašel dokumentaci, jak Baker rekódovala specializace. V datech je 9 kategorií (nemá matematiku jako my). Abych se s tím nedrbal donekonečna, tak jsem to zatím uděl podle těch 9/10 kategorií, co jsou v datech, pak se když tak poradíme, jak to sloučit.

dfs = df %>% select(rep_podil, spec_hlavni, Author) %>% filter(!is.na(rep_podil)) %>%  
  group_by(Author, spec_hlavni) %>% 
  mutate(N = n()) %>% 
  group_by(Author, rep_podil, spec_hlavni, N) %>% summarise(n = n()) %>% 
  mutate(f = round(n / N * 100, 1), rep_podil = factor(rep_podil))
## `summarise()` has grouped output by 'Author', 'rep_podil', 'spec_hlavni'. You
## can override using the `.groups` argument.
ggplot(dfs, aes(y = rep_podil, x = f, fill = Author, label = paste(f, "%"))) +
  facet_grid(cols = vars(Author), rows = vars(spec_hlavni)) +
  geom_col(alpha = 0.75) +
  geom_text() +
  labs(title = "HOW MUCH PUBLISHED WORK IN YOUR FIELD IS REPRODUCIBLE?", 
       # caption = paste0("Počty platných pozorování: N_Baker = ", dfs[1, "N"], 
       #                  ", N_Jedlička = ", dfs[8, "N"])
       ) +
  # scale_x_log10() +
  guides(fill = "none") +
  theme_minimal()

HOW MUCH PUBLISHED WORK IN YOUR FIELD IS REPRODUCIBLE? (Averages and error bars)

Tak, souhlasím, že vizualizace dle Baker hrozně tříští informaci. Proto zkusíme reprezentovat jen průměry a intervaly spolehlivosti pro tyto průměry.

dfs = df %>% select(rep_podil, spec_hlavni, Author) %>% filter(!is.na(rep_podil)) %>%  
  group_by(Author, spec_hlavni) %>% 
  summarise(mean = mean(rep_podil, na.rm = T), sd = sd(rep_podil, na.rm = T), n = n() - 1,
            se = 1.96 * sd / (n ^ 0.5), l = paste0(round(mean, 0))) %>% 
  ungroup() %>% 
  mutate(spec_hlavni = fct_reorder(.f = spec_hlavni, .x = (mean * (Author == "Baker")))) 
## `summarise()` has grouped output by 'Author'. You can override using the
## `.groups` argument.
ggplot(dfs, aes(y = spec_hlavni, col = spec_hlavni, x = mean, 
                xmin = mean - se, xmax = mean + se, label = l)) +
  facet_grid(cols = vars(Author)) +
  geom_pointrange() +
  geom_errorbarh() +
  guides(color = "none") +
  geom_point(alpha = 0.75, size = 10) +
  geom_text(col = "white", size = 5) +
  labs(title = "HOW MUCH PUBLISHED WORK IN YOUR FIELD IS REPRODUCIBLE?", 
       # caption = paste0("Počty platných pozorování: N_Baker = ", dfs[1, "N"], 
       #                  ", N_Jedlička = ", dfs[8, "N"])
       ) +
  # scale_x_log10() +
  guides(fill = "none") +
  theme_minimal()

Srovnání zastoupení oborů v obou vzorcích

dfs = df %>% select(spec_hlavni, Author) %>% filter(!is.na(spec_hlavni)) %>%  
  group_by(Author) %>% 
  mutate(N = n()) %>% 
  group_by(Author, spec_hlavni, N) %>% summarise(n = n()) %>% 
  mutate(f = round(n / N * 100, 1))
## `summarise()` has grouped output by 'Author', 'spec_hlavni'. You can override
## using the `.groups` argument.
ggplot(dfs, aes(y = fct_rev(spec_hlavni), x = f, fill = Author, label = paste(f, "%"))) +
  geom_col(alpha = 0.75, position = position_dodge2(reverse = T)) +
  geom_text(position = position_dodge2(width = 0.9, reverse = T)) +
  labs(title = "Percentges of fields in both samples", 
       caption = paste0("N_Baker = ", dfs[1, "N"],
                        ", N_Jedlička = ", dfs[18, "N"])
       ) +
  # scale_x_log10() +
  # guides(fill = "none") +
  theme_minimal()

HAVE YOU FAILED TO REPRODUCE AN EXPERIMENT? (Own/Someone’s else)

dfs = df %>% select(starts_with("nerep_"), Author) %>% 
  filter(!is.na(nerep_jaMuj)) %>% filter(!is.na(nerep_jaCizi)) %>% filter(!is.na(nerep_oniMuj)) %>% 
  pivot_longer(cols = starts_with("nerep"), names_to = "Type", names_prefix = "nerep_") %>% 
  mutate(Type = recode(Type, jaCizi = "I failed to\nreproduce others", jaMuj = "I failed to\nreproduce mine",
                       oniMuj = "They failed to\nreproduce mine") %>% factor()) %>% 
  group_by(Author, Type) %>% 
  mutate(N = n()) %>% 
  group_by(Author, Type, value, N) %>% summarise(n = n()) %>% 
  mutate(f = round(n / N * 100, 1))
## `summarise()` has grouped output by 'Author', 'Type', 'value'. You can override
## using the `.groups` argument.
ggplot(dfs, aes(y = fct_rev(value), x = f, fill = Type, label = paste(f, "%"))) +
  facet_grid(cols = vars(Author)) +
  geom_col(alpha = 0.75, position = position_dodge2(reverse = T)) +
  geom_text(position = position_dodge2(width = 0.9, reverse = T)) +
  labs(title = "HAVE YOU FAILED TO REPRODUCE AN EXPERIMENT? (Own/Someone's else)", 
       caption = paste0("N_Baker = ", dfs[1, "N"],
                        ", N_Jedlička = ", dfs[18, "N"])
       ) +
  # scale_x_log10() +
  # guides(fill = "none") +
  theme_minimal()

dfs = df %>% select(starts_with("nerep_"), Author, spec_hlavni) %>% 
  filter(!is.na(nerep_jaMuj)) %>% filter(!is.na(nerep_jaCizi)) %>% filter(!is.na(nerep_oniMuj)) %>% 
  pivot_longer(cols = starts_with("nerep"), names_to = "Type", names_prefix = "nerep_") %>% 
  mutate(Type = recode(Type, jaCizi = "I failed to\nreproduce others", jaMuj = "I failed to\nreproduce mine",
                       oniMuj = "They failed to\nreproduce mine") %>% factor()) %>% 
  group_by(Author, Type, spec_hlavni) %>% 
  mutate(N = n()) %>% 
  group_by(Author, Type, value, N, spec_hlavni) %>% summarise(n = n()) %>% 
  mutate(f = round(n / N * 100, 1)) %>% 
  filter(value == "Yes")
## `summarise()` has grouped output by 'Author', 'Type', 'value', 'N'. You can
## override using the `.groups` argument.
ggplot(dfs, aes(y = fct_rev(spec_hlavni), x = f, fill = Type, label = paste(f, "%"))) +
  facet_grid(cols = vars(Author)) +
  geom_col(alpha = 0.75, position = position_dodge2(reverse = T)) +
  geom_text(position = position_dodge2(width = 0.9, reverse = T)) +
  labs(title = "HAVE YOU FAILED TO REPRODUCE AN EXPERIMENT? (Own/Someone's else)", 
       x = "%",
       # caption = paste0("Počty platných pozorování: N_Baker = ", dfs[1, "N"],
       #                  ", N_Jedlička = ", dfs[18, "N"])
       ) +
  # scale_x_log10() +
  # guides(fill = "none") +
  theme_minimal()

WHAT FACTORS CONTRIBUTE TO IRREPRODUCIBLE RESEARCH?

dfs = df %>% select(starts_with("fakt_"), Author) %>% 
  # filter(!is.na(nerep_jaMuj)) %>% filter(!is.na(nerep_jaCizi)) %>% filter(!is.na(nerep_oniMuj)) %>% 
  pivot_longer(cols = starts_with("fakt_"), names_to = "Typ", names_prefix = "fakt_") %>% 
  filter(!is.na(value)) %>% mutate(Typ = factor(Typ)) %>% group_by(Author, Typ) %>% 
  mutate(N = n()) %>% filter(value != "I don't know") %>% mutate(N_bezNevim = n()) %>% 
  filter(value == "Always" | value == "Very often" | value == "Sometimes") %>%
  mutate(
    value = recode(value, Always = "Always/often", `Very often` = "Always/often"),
    Typ =
      recode(
        Typ, selekc = "Selective reporting", karier = "Pressure to publish",
        statis = "Low statistical power", replik = "Not replicated enough",
        dohled = "Insuficient oversight", dokume = "Methods, code unavailable",
        design = "Poor experimental design", data = "Raw data not available",
        podvod = "Fraud", recenz = "Insufficient peer review", 
        odborn = "Mistakes or inadequate expertise\nin reproduction efforts",
        metody = "Methods need particular\ntechnical expertise", variab = "Variability of reagents", smula = "Bad luck"
        )) %>%
  group_by(Author, Typ, value, N, N_bezNevim) %>% summarise(n = n()) %>%
  mutate(f = round(n / N * 100, 1), 
         X = if_else(value == "Always/often" & Author == "Baker", f, 0) %>% sum(),
         f_bezNevim = round(n / N_bezNevim * 100, 1)
         ) %>% ungroup() %>% group_by(Author, Typ, N, N_bezNevim) %>%
  mutate(X = sum(X)) %>% ungroup() %>% 
  mutate(Typ = fct_reorder(.f = Typ, .x = X)) %>% rename("Answer" = value)
## `summarise()` has grouped output by 'Author', 'Typ', 'value', 'N'. You can
## override using the `.groups` argument.
ggplot(dfs, aes(y = Typ, x = f, fill = Answer, label = paste0(f, "%"))) +
  facet_grid(cols = vars(Author)) +
  geom_col(alpha = 0.75, position = position_stack(reverse = T)) +
  geom_text(position = position_stack(reverse = T, vjust = 0.5), size = 2) +
  labs(title = "WHAT FACTORS CONTRIBUTE TO IRREPRODUCIBLE RESEARCH?\n(including 'I don't know')", 
       x = "%") +
  theme_minimal()

ggplot(dfs, aes(y = Typ, x = f_bezNevim, fill = Answer, label = paste0(f_bezNevim, "%"))) +
  facet_grid(cols = vars(Author)) +
  geom_col(alpha = 0.75, position = position_stack(reverse = T)) +
  geom_text(position = position_stack(reverse = T, vjust = 0.5), size = 2) +
  labs(title = "WHAT FACTORS CONTRIBUTE TO IRREPRODUCIBLE RESEARCH?\n(without 'I don't know')", 
       x = "%") +
  theme_minimal()

dfs = df %>% select(starts_with("fakt_"), Author) %>% 
  # filter(!is.na(nerep_jaMuj)) %>% filter(!is.na(nerep_jaCizi)) %>% filter(!is.na(nerep_oniMuj)) %>% 
  pivot_longer(cols = starts_with("fakt_"), names_to = "Typ", names_prefix = "fakt_") %>% 
  filter(!is.na(value)) %>% mutate(Typ = factor(Typ)) %>% group_by(Author, Typ) %>% 
  mutate(N = n()) %>% filter(value != "I don't know") %>% mutate(N_bezNevim = n()) %>% 
  filter(value == "Always" | value == "Very often" | value == "Sometimes") %>%
  mutate(
    Typ =
      recode(
        Typ, selekc = "Selective reporting", karier = "Pressure to publish",
        statis = "Low statistical power", replik = "Not replicated enough",
        dohled = "Insuficient oversight", dokume = "Methods, code unavailable",
        design = "Poor experimental design", data = "Raw data not available",
        podvod = "Fraud", recenz = "Insufficient peer review", 
        odborn = "Mistakes or inadequate expertise\nin reproduction efforts",
        metody = "Methods need particular\ntechnical expertise", variab = "Variability of reagents", smula = "Bad luck"
        )) %>%
  group_by(Author, Typ, value, N, N_bezNevim) %>% summarise(n = n()) %>%
  mutate(f = round(n / N * 100, 1), 
         X = if_else((value == "Always"  | value == "Very often") & Author == "Baker", f, 0) %>% sum(),
         f_bezNevim = round(n / N_bezNevim * 100, 1)
         ) %>% ungroup() %>% group_by(Author, Typ, N, N_bezNevim) %>%
  mutate(X = sum(X)) %>% ungroup() %>% 
  mutate(Typ = fct_reorder(.f = Typ, .x = X)) %>% rename("Answer" = value)
## `summarise()` has grouped output by 'Author', 'Typ', 'value', 'N'. You can
## override using the `.groups` argument.
ggplot(dfs, aes(y = Typ, x = f, fill = Answer, label = paste0(f, "%"))) +
  facet_grid(cols = vars(Author)) +
  geom_col(alpha = 0.75, position = position_stack(reverse = T)) +
  geom_text(position = position_stack(reverse = T, vjust = 0.5), size = 2) +
  labs(title = "WHAT FACTORS CONTRIBUTE TO IRREPRODUCIBLE RESEARCH?\n(including 'I don't know')", 
       x = "%") +
  theme_minimal()

ggplot(dfs, aes(y = Typ, x = f_bezNevim, fill = Answer, label = paste0(f_bezNevim, "%"))) +
  facet_grid(cols = vars(Author)) +
  geom_col(alpha = 0.75, position = position_stack(reverse = T)) +
  geom_text(position = position_stack(reverse = T, vjust = 0.5), size = 2) +
  labs(title = "WHAT FACTORS CONTRIBUTE TO IRREPRODUCIBLE RESEARCH?\n(without 'I don't know')", 
       x = "%") +
  theme_minimal()

Tady jsem nereprodukoval graf od Baker přesně. Dal jsem si sice záležet, abych kategorie faktorů uspořádal podle součtu odpovědí v kategoriích ‘Vždy’ a ‘Velmi často’, a tak se dalo ověřit, jestli to dělám dobře a seřadím kategorie stejně. Taky kvůli tomu, abys líp našel jednotlivé faktory v těch mých zkratkách. Ale hlavně proto, abychom se mohli podívat, jak má často ona kategorii ‘Vždy’ a jak ‘Velmi často’, tedy, aby bylo jasné, v jakém poměru se ta její sdružená kategorie skládá. A tady je to zajímavé!

Je jasně vidět, že u Baker jsou respondenti víc ultimátní, mnohem častěji volí “Vždy”, než zvolili Češi v našem výzkumu! Češi se jedině rozšoupnou u faktoru ‘karier’, což je Pressure to publish, tam dají 12.5%, ale bacha, stejně nebo víc použijí ve výzkumu Baker tuhle kategorii ‘Vždy’ zahraniční vědci u 7 faktorů ze 14! Ale ‘karier’ vybočuje u Čechů celkově – v součtu dvou nejintenzivnějších kategorií faktor tlaku na publikování a kariéru označují dokonce častěji než zahraniční u Baker, v Čechách je to jediný hojně uváděný typ důvodu, ještě ‘selekc’, tj. Selektivní reportování se přehoupne přes 50 %, ale jinak jsou všechny faktory pod 40 % a vždy méně než Baker, ta má mimochodem pod 40 % jen 5 důvodů ze 14, my 12 ze 14.

Přemýšlím, čím to může být? Napadá mi, že (1) Češi jsou větší idealisti, nebo (2) tím, že dotazujeme ty nejlepší instituce, tak jde o ego-defense strategy, tedy nepřiznají ani sobě, jak často se ta špína děje, nebo (3) měla Baker ve vzorku i nějaká méně prestižní pracoviště, kde se s tím prostě častěji ti vědci setkali, že se tam fixlovalo, a nebo (4) Češi chápou jinak otázku než zahraniční vědci – zatím co Čech to chápe “Když se někde najde sfixlovaný výsledek, je za tím vždy faktor X” a protože to berou tak, že k fixlování vedou různé důvody, spíš sáhnou po ‘Velmi často’ nebo ‘Někdy’ jelikož jim přijde absurdní, že za všemi fixlováními by byla kariéra, podvod atd. Naproti tomu cizinci to pochopili “Když se dostane ke slovu faktor X, jak často způsobí fixlování?” Řekl bych, že Baker určitě má na mysli ten druhý mód: jak často X vede k nekalostem. Jak bychom ověřili, že to tak chápou i Češi, resp. vyloučili, že to nepochopili jako: když se objeví nekalost, může za to X.

„Less Scientifically Developed Culture“ effect (LSDC Effect)

Kde je nejvíc nereprodukovatelného výzkumu?

dfs = df %>% 
  pivot_longer(cols = starts_with("reg_")) %>% select(Author:value) %>% 
  filter(!is.na(value)) %>% 
  mutate(
    value = 
      recode(
        value, Asie = "Asia", Evropa = "Europe", `Severní Amerika` = "North America",
        `Ve všech regionech je podíl nereprodukovatelných studií stejný` =
          "In all regions, the proportion of\nnon-reproducible studies is the same.",
        `Nedokážu posoudit` = "I cannot tell"
      )
  ) %>% mutate(N = n(), value = factor(value) %>% fct_infreq() %>% fct_rev()) %>% 
  group_by(value, N) %>% summarise(n = n()) %>% mutate(f = round(n / N * 100, 1))
## `summarise()` has grouped output by 'value'. You can override using the
## `.groups` argument.
ggplot(dfs, aes(y = value, x = f, fill = value, label = paste0(f, "%"))) +
  geom_col() +
  geom_text(position = position_stack(vjust = 0.8)) +
  guides(fill = "none") + 
  labs(title = "In which region(s) do I think there is the largest share of\nnon-reproducible studies?",
       y = "", x = "") +
  theme_minimal()

Ve které zemi nebo zemích je podle mě nejvyšší podíl nereprodukovatelného výzkumu?

# Co potřebuju?
# 1) rozsekat string na jednotlivá slova a reshapovat
# 2) překódovat víceslovné názvy na jednoslovné a opravit překlepy
# 3) udělat faktor podle frekvence
msx = str_split_fixed(df$nerep_zeme, pattern = boundary("word"), n = 164) %>% as.data.frame() %>%
  rowid_to_column() %>% 
  pivot_longer(cols = V1:V164) %>% filter(value != "") %>% 
  mutate(
    value = 
      recode(
        value, 
        Afriky = "Afrika", Americe = "Amerika", amerika = "Amerika", USA = "Amerika",
        Arabské = "Arábie", Arabi = "Arábie", arabske = "Arábie", Saudi = "Arábie", Blízký = "Arábie", 
        asie = "Asie", Asii = "Asie", asijských = "Asie",
        Cina = "Čína", Ćína = "Čína", Čina = "Čína", čína = "Čína", ČÍNA = "Čína", Čínská = "Čína", china  = "Čína",
        China = "Čína", Číně = "Čína", čínská = "Čína", čínských = "Čína", čínským = "Čína", CN = "Čína", 
        Evropě = "Evropa",
        India = "Indie", indie = "Indie", INDIE = "Indie", 
        Iran = "Írán", Irán = "Írán", Italie = "Itálie",
        Pakiskan = "Pákistán", Pakistán = "Pákistán", 
        SSSR = "Rusko", ruských = "Rusko", russia = "Rusko",
        Ale = "ale", Fabulované = "fab", JV = "jv", Kde = "kde",  # Překódování slov, které nejsou země, aby zmizely.
        Mimochodem = "mmchd", Momentálně = "mmnt", Nedokážu = "nedokážu", Nejde = "nejde", Nelze = "nejde",
        Nejvíce = "nej", Neda = "ne", Nemám = "ne", Nemohu = "ne", Neřekla = "ne", Obávám = "obv", Obecně = "obc",
        Odhadovala = "odh", Podíl = "pod", Prosím = "pls", Rozvíjející = "roz", Saudská = "saud",
        Severní = "sev", Spíš = "spíš", Tam = "tam", To = "to", Toto = "to", V = "v", Velká = "vel",
        Všude = "all", Východ = "vých", Bývalé = "býv", Bývalý = "býv"
        )) %>% 
  filter(str_detect(value, pattern = '[:upper:]')) %>% filter(value != "Nevím") %>% 
  mutate(
    value = 
      recode(
        value, Čína = "China", Indie = "India", Amerika = "USA",
        Rusko = "Russia", Írán = "Iran", Asie = "Asia", Pákistán = "Pakistan",
        Arábie = "Arabic countries", Japonsko = "Japan", Afrika = "Africa", Turecko = "Turkey"),
    U = unique(rowid) %>% length(), 
    value = as_factor(value) %>% fct_infreq() %>% fct_rev(), 
    N = n()) %>% 
  group_by(value, N, U) %>% summarise(n = n()) %>% filter(n > 2) %>% 
  mutate(f = round(n / nrow(my) * 100, 1))  
## `summarise()` has grouped output by 'value', 'N'. You can override using the
## `.groups` argument.
  # 'nrow(my)' is size of sample, 
  # 'U' is number of respondents giving at least one answer,
  # 'N' is number of mentions

ggplot(msx, aes(y = value, x = f, label = paste0(f, "%"), fill = value)) +
  geom_col(width = 0.8) +
  geom_text(position = position_stack(vjust = 0.5)) +
  guides(fill = "none") +
  labs(title = "Which country or countries do you think have the highest\nproportion of non-reproducible research?",
       y = "Country", x = "", 
       caption = paste0("Note: Respondents might mention more than one country.\nWe ask ", nrow(my), 
                        " respondents to mention countries, from ", msx[1, "U"], " respondents we received ", 
                        msx[1, "N"], " mentions."),
       subtitle = paste0("(N=", nrow(my), ")")) +
  theme_minimal()

Zaznamenal(a) jsem ze strany zahraničních vědců ve svém oboru despekt k českým vědcům nebo k české vědě (kvůli nižším standardům vědecké práce apod.)?

dfs = df %>%  filter(!is.na(despekt)) %>% 
  mutate(N = n()) %>% group_by(despekt, N) %>% summarise(n = n()) %>% 
  mutate(f = round(n / N * 100, 1)) 
## `summarise()` has grouped output by 'despekt'. You can override using the
## `.groups` argument.
ggplot(dfs, aes(x = despekt, y = f, label = paste0(f, "%"), fill = despekt)) +
  geom_col(width = 0.7) +
  geom_text(position = position_stack(vjust = 0.5)) +
  labs(title = "I have noticed a disdain from foreign scientists in my field for Czech scientists\nor for Czech science (because of lower standards of scientific work, etc.)?", x = "", y = "%") +
  guides(fill = "none") +
  theme_minimal()

Které země?

Bude doplněno.

Pokud jsem zaznamenal(a) tento despekt:

dfs = df %>% select(starts_with("desp_"), -desp_zeme) %>% rowid_to_column() %>% 
  pivot_longer(cols = starts_with("desp_"), names_prefix = "desp_") %>% 
  filter(!is.na(value)) %>% 
  mutate(
    name = 
      recode(
        name, nevim = "I cannot tell.", jazyk = "It was a linguistic issue,\nas I am not a native speaker.",
        kvalita = "It was related to the quality\nof the scientific work.", vlastni = "Other"
      ) %>% factor() %>% fct_infreq() %>% fct_rev(),
    U = unique(rowid) %>% length(), N = n()
  ) %>% group_by(U, N, name) %>% summarise(n = n()) %>% 
  mutate(f = round(n / U * 100, 1))
## `summarise()` has grouped output by 'U', 'N'. You can override using the
## `.groups` argument.
ggplot(dfs, aes(y = name, x = f, fill = name, label = paste0(f, "%"))) +
  geom_col() +
  geom_text(position = position_stack(vjust = 0.8)) +
  guides(fill = "none") + 
  labs(title = paste0("If I have noted this disrespect:\n(N=", dfs[1, "U"], ")"),
       caption = paste0("Note: Respondents might chose up to three answers, ", dfs[1, "U"],
                        " respondents gave at least one answer."),
       y = "", x = "%") +
  theme_minimal()

Jak reaguji, když zjistím, že vědecké studie z určitých regionů vykazují ve větší míře vady nebo jsou nereprodukovatelné?

dfs = df %>% select(starts_with("reakce_")) %>% rowid_to_column() %>% 
  pivot_longer(cols = starts_with("reakce_"), names_prefix = "reakce_") %>% 
  filter(!is.na(value)) %>% 
  mutate(
    name = 
      recode(
        name, nectu = "I stop reading studies\nfrom that region altogether.",
        vyberu = "I only choose labs that I believe\nare producing high quality science.",
        obezret = "I continue to read studies from this region,\nbut I am cautious about their conclusions.", 
        nic = "Not at all, because even studies from this\nregion can contain valuable results.", 
        stejne = "In my field, studies from all\nregions are of equal quality.",
        nevim = "I cannot tell.", vlastni = "Other"
      ) %>% factor() %>% fct_infreq() %>% fct_relevel("Other", after = 7) %>% 
      fct_relevel("I cannot tell.", after = 7) %>% fct_rev(),
    U = unique(rowid) %>% length(), N = n()
  ) %>% group_by(U, N, name) %>% summarise(n = n()) %>% 
  mutate(f = round(n / U * 100, 1))
## `summarise()` has grouped output by 'U', 'N'. You can override using the
## `.groups` argument.
ggplot(dfs, aes(y = name, x = f, fill = name, label = paste0(f, "%"))) +
  geom_col() +
  geom_text(position = position_stack(vjust = 0.8)) +
  guides(fill = "none") + 
  labs(title = paste0("How do I react when I find that scientific studies from\ncertain regions are more likely to have flaws or\nare non-reproducible?\n(N=", dfs[1, "U"], ")"),
       caption = paste0("Note: Respondents might chose up to three answers, ", dfs[1, "U"],
                        " respondents gave at least one answer."),
       y = "", x = "%") +
  theme_minimal()

Co poškozuje objektivitu? Jak často se to objevuje a jak velká hrozba to je?

Uděláme si na to pěkný graf! Každé položce spočítáme dva údaje:

  1. Kolik procent lidí označilo položku jako hrozbu
  2. Kolik je průměrná frekvence výskytu položky
dfs = df %>% select(starts_with("frek_"), starts_with("neobj_")) %>% 
  pivot_longer(cols = everything()) %>% filter(!is.na(value)) %>% 
  group_by(name) %>% summarise(f = round(mean(value, na.rm = T), 1)) %>% 
  separate(name, sep = "_", into = c("dim", "item")) %>% 
  pivot_wider(id_cols = item, names_from = dim, values_from = f) %>% 
  mutate(
    item = 
      recode(
        item, 
        salam = "Using the 'salami method', where the results are published in multiple papers (possibly duplicated).",
        popular = "Insertion of 'popular' terms (e.g. in texts or articles) for the sole purpose of improving publishability.",
        positiv = "Publishing only positive or statistically relevant results.",
        nerepre = "Unjustified generalization of results where data are interpreted in too broad a context (insufficient large sample or unrepresentative sample).",
        hacking = "P-hacking and similar strategies (data-dredging, significance-chasing, harking).",
        zbytecne = "Publication of unimportant studies whose scientific contribution is questionable.",
        oversell = "Deliberate exaggeration of scientific results ('overselling').",
        modmet = "Use of 'fashionable' methods in research without scientific justification (e.g. in grant applications).")
  )

ggplot(dfs, 
       aes(x = frek, y = neobj, col = item, 
           label = str_wrap(paste0(item, ": X = ", frek, "%, Y = ", neobj, "%."), 30))) +
  geom_label_repel(size = 3) +
  geom_point(size = 5, alpha = 0.3) +
  guides(col = "none") +
  labs(title = "How often do I encounter the following phenomena in my field? VS.\nDo I think the following phenomena distort the objectivity of science?",
       y = "Do I think the following phenomena distort the objectivity of science? (%)",
       x = "How often do I encounter the following phenomena in my field? (%)") +
  theme_minimal()

Demographics

Gender:

dfs = df %>%  filter(!is.na(gender)) %>% 
  mutate(gender = recode(gender, Muž = "Male", Žena = "Female", Ostatní = "Other") %>% 
           factor() %>% fct_infreq(),
         N = n()) %>% group_by(gender, N) %>% summarise(n = n()) %>% 
  mutate(f = round(n / N * 100, 1)) 
## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
ggplot(dfs, aes(x = gender, y = f, label = paste0(f, "%"), fill = gender)) +
  geom_col(width = 0.7) +
  geom_text(position = position_stack(vjust = 0.5)) +
  labs(title = paste0("I am:\n(N=", dfs$N[1],")"), x = "Gender", y = "%") +
  guides(fill = "none") +
  theme_minimal()

My age is:

dfs = df %>%  filter(!is.na(vek)) %>% 
  mutate(vek = recode(vek, `65 a více` = "65+"),
         N = n()) %>% group_by(vek, N) %>% summarise(n = n()) %>% 
  mutate(f = round(n / N * 100, 1)) 
## `summarise()` has grouped output by 'vek'. You can override using the `.groups`
## argument.
ggplot(dfs, aes(x = vek, y = f, label = paste0(f, "%"), fill = vek)) +
  geom_col(width = 0.7) +
  geom_text(position = position_stack(vjust = 0.5)) +
  labs(title = paste0("My age is:\n(N=", dfs$N[1],")"), x = "Gender", y = "%") +
  guides(fill = "none") +
  theme_minimal()

How many years do I work in research:

dfs = df %>% filter(!is.na(kar_praxe)) %>% 
  mutate(
    kar_praxe = 
      recode(kar_praxe, `5–9 let` = "5–9 years", `10–19 let` = "10–19 years",
             `20 a více let` = "20+ years", `Méně než 5 let` = "<5 years") %>% 
      factor(levels = c("<5 years", "5–9 years", "10–19 years", "20+ years")),
    N = n()) %>% 
  group_by(kar_praxe, N) %>% summarise(n = n()) %>% mutate(f = round(n / N * 100, 1))
## `summarise()` has grouped output by 'kar_praxe'. You can override using the
## `.groups` argument.
ggplot(dfs, aes(x = kar_praxe, y = f, label = paste0(f, "%"), fill = kar_praxe)) +
  geom_col(width = 0.7) +
  geom_text(position = position_stack(vjust = 0.5)) +
  labs(title = "How many years do I work in research:", x = "", y = "") +
  guides(fill = "none") +
  theme_minimal()

Jaká je moje nejvyšší dosažená akademická hodnost:

Má to cenu překládat? Zájímá to mimo ČR vůbec někoho?

df %>% filter(!is.na(kar_titul)) %>% 
  mutate(
    kar_titul = fct_relevel(kar_titul, "Mgr./Ing.", after = 1),
    N = n()) %>% 
  group_by(kar_titul, N) %>% summarise(n = n()) %>% mutate(f = round(n / N * 100, 1)) %>%
  ggplot(aes(x = kar_titul, y = f, label = paste0(f, "%"), fill = kar_titul)) +
  geom_col(width = 0.7) +
  geom_text(position = position_stack(vjust = 0.5)) +
  labs(title = "My the best academic rank is:", x = "", y = "") +
  guides(fill = "none") +
  theme_minimal()
## `summarise()` has grouped output by 'kar_titul'. You can override using the
## `.groups` argument.

Tady mi jenom napadá – nevyhodíme ty Bc. respondenty? Jsou to asi 4 lidi, nebo 5 a zvýší to snad důvěryhodnost výsledků. U magistrů, inženýrů bych hodně váhal, to je 130 lidí.

Foo hoo! Tak jinak! Baker má data od spousty lidí, co jsou “jen” Ph.D. studenti, takže Mgr./Ing. rozhodně nechat, Baker jich má taky hafo: PhD Students má 410 (26.02%), Post-doctoral Fellows má 314 (19.92%), dohromady tyto dvě skupiny tvoří 46 % vzorku! To je taky možná důvod těch rozdílů, že těchto “relativních juniorů” je v tom jejím výzkumu mnohem víc než v tom našem.

Počet citací (odhad):

dfs = df %>% filter(!is.na(kar_citace)) %>% 
  mutate(
    kar_citace = 
      recode(kar_citace, `Do 100` = "0–100", `Do 500` = "101–500",
             `Do 1000` = "501–1000", `Do 5000` = "1001–5000", Více = "5000+") %>% 
      factor(levels = c("0–100", "101–500", "501–1000", "1001–5000", "5000+")),
    N = n()) %>% 
  group_by(kar_citace, N) %>% summarise(n = n()) %>% mutate(f = round(n / N * 100, 1))
## `summarise()` has grouped output by 'kar_citace'. You can override using the
## `.groups` argument.
ggplot(dfs, aes(x = kar_citace, y = f, label = paste0(f, "%"), fill = kar_citace)) +
  geom_col(width = 0.7) +
  geom_text(position = position_stack(vjust = 0.5)) +
  labs(title = "Number of citations:", x = "Citations", y = "") +
  guides(fill = "none") +
  theme_minimal()

Publikoval jsem (i jako spoluautor) v nejlepším oborovém časopisu:

dfs = df %>% filter(!is.na(kar_nejlepsi)) %>% 
  mutate(
    kar_nejlepsi = 
      recode(kar_nejlepsi, Ano = "Yes", Ne = "No") %>% 
      factor(levels = c("Yes", "No")),
    N = n()) %>% 
  group_by(kar_nejlepsi, N) %>% summarise(n = n()) %>% mutate(f = round(n / N * 100, 1))
## `summarise()` has grouped output by 'kar_nejlepsi'. You can override using the
## `.groups` argument.
ggplot(dfs, aes(x = kar_nejlepsi, y = f, label = paste0(f, "%"), fill = kar_nejlepsi)) +
  geom_col(width = 0.7) +
  geom_text(position = position_stack(vjust = 0.5)) +
  labs(title = "I got published (as co-author, as well) in the best journal in the field: ", x = "", y = "") +
  guides(fill = "none") +
  theme_minimal()

Do jaké kategorie dle klasifikace Akademie věd patřím:

Asi jen pro nás, proto to zatím nepřekládám, jestli se Ti to bude na něco hodit, dej vědět. Taky tu nechávám absolutní počty. Aby to bylo použitelné, budeme to muset sjednotit napříč institucemi.

df %>% filter(!is.na(kar_kategorie)) %>% 
  mutate(
    kar_kategorie = fct_relevel(kar_kategorie, "Doktorand do třídy V2", after = 2)  # Pozor! Pozice této kategorie zůstane v datech stejná!
  ) %>%
  ggplot(aes(y = kar_kategorie , fill = kar_kategorie)) +
  geom_bar(width = 0.7) +
  labs(title = "", x = "Počty respondentů", y = "Kategorie") +
  guides(fill = "none") +
  theme_minimal()

Dodatečný graf

Tak a tady si uděláme 3 verze jednoho grafu, neb Baker se ptá na 4 výroky ohledně neroprodukovatelnosti a z nich se my ptáme na 3 a ještě na další 3, takže slušný mr*ník :-)

dfs = df %>% select(Author, starts_with("nerep_"), -c(nerep_zeme, nerep_jaMuj:nerep_jaCizi)) %>% 
  pivot_longer(cols = starts_with("nerep_"), names_prefix = "nerep_") %>% 
  filter(!is.na(value)) %>% 
  mutate(
    name = 
      recode(
        name, problem = "I think that the failure to reproduce scientific studies is a major problem in my field.",
        problemAll = "I think that the failure to reproduce scientific studies is a major problem for all fields.",
        chyba = "I think that a failure to reproduce a result most often means that the original finding is wrong.",
        valid = "Nereproducibility detracts the validity of the original finding.",
        neobj = "I think that a failure to reproduce rarely detracts from the validity of the original finding.",
        veda = "Also irreproducible results are the part of the science.",
        narust = "The fraction of irreproducibile findings has increased during my carrier."
        ),
    value = factor(value, levels = c("Strongly agree", "Agree", "Neither agree nor disagree", "Disagree", 
                                     "Strongly disagree", "I don't know"))
    ) %>% group_by(Author, name) %>% 
  mutate(N = n()) %>% group_by(Author, name, N, value) %>% 
  summarise(n = n()) %>% mutate(f = round(n / N * 100, 1))
## `summarise()` has grouped output by 'Author', 'name', 'N'. You can override
## using the `.groups` argument.
dfs %>% 
  ggplot(aes(y = str_wrap(name, 50), x = f, label = paste0(f, "%"), fill = value)) +
  facet_grid(cols = vars(Author)) +
  geom_col(position = position_stack(reverse = T), alpha = 0.8) +
  geom_label_repel(position = position_stack(vjust = 0.5, reverse = T), alpha = 0.75) +
  labs(y = "Sentences", x = "%", title = "Agreement with selected sentences") +
  # guides(label = "none") +
  theme_minimal() +
  theme(legend.position = "bottom")

dfs %>% 
  filter(name == "I think that the failure to reproduce scientific studies is a major problem in my field." |
        name == "I think that the failure to reproduce scientific studies is a major problem for all fields." |
        name == "I think that a failure to reproduce a result most often means that the original finding is wrong." |
        name == "I think that a failure to reproduce rarely detracts from the validity of the original finding.") %>% 
  ggplot(aes(y = str_wrap(name, 50), x = f, label = paste0(f, "%"), fill = value)) +
  facet_grid(cols = vars(Author)) +
  geom_col(position = position_stack(reverse = T), alpha = 0.8) +
  geom_label_repel(position = position_stack(vjust = 0.5, reverse = T), alpha = 0.75) +
  labs(y = "Sentences", x = "%", title = "Agreement with selected sentences") +
  # guides(label = "none") +
  theme_minimal() +
  theme(legend.position = "bottom")

dfs %>% 
  filter(name == "I think that the failure to reproduce scientific studies is a major problem in my field." |
        name == "I think that a failure to reproduce a result most often means that the original finding is wrong." |
        name == "I think that a failure to reproduce rarely detracts from the validity of the original finding.") %>% 
  ggplot(aes(y = str_wrap(name, 50), x = f, label = paste0(f, "%"), fill = value)) +
  facet_grid(cols = vars(Author)) +
  geom_col(position = position_stack(reverse = T), alpha = 0.8) +
  geom_label_repel(position = position_stack(vjust = 0.5, reverse = T), alpha = 0.75) +
  labs(y = "Sentences", x = "%", title = "Agreement with selected sentences") +
  # guides(label = "none") +
  theme_minimal() +
  theme(legend.position = "bottom")

dfsx = df %>% select(Author, starts_with("nerep_"), -c(nerep_zeme, nerep_jaMuj:nerep_jaCizi)) %>% 
  pivot_longer(cols = starts_with("nerep_"), names_prefix = "nerep_") %>% 
  filter(!is.na(value), value != "I don't know") %>% 
  mutate(
    name = 
      recode(
        name, problem = "I think that the failure to reproduce scientific studies is a major problem in my field.",
        problemAll = "I think that the failure to reproduce scientific studies is a major problem for all fields.",
        chyba = "I think that a failure to reproduce a result most often means that the original finding is wrong.",
        valid = "Nereproducibility detracts the validity of the original finding.",
        neobj = "I think that a failure to reproduce rarely detracts from the validity of the original finding.",
        veda = "Also irreproducible results are the part of the science.",
        narust = "The fraction of irreproducibile findings has increased during my carrier."
        ),
    value = factor(value, levels = c("Strongly agree", "Agree", "Neither agree nor disagree", "Disagree", 
                                     "Strongly disagree", "I don't know"))
    ) %>% group_by(Author, name) %>% 
  mutate(N = n()) %>% group_by(Author, name, N, value) %>% 
  summarise(n = n()) %>% mutate(f = round(n / N * 100, 1))
## `summarise()` has grouped output by 'Author', 'name', 'N'. You can override
## using the `.groups` argument.
dfsx %>% 
  filter(name == "I think that the failure to reproduce scientific studies is a major problem in my field." |
        name == "I think that a failure to reproduce a result most often means that the original finding is wrong." |
        name == "I think that a failure to reproduce rarely detracts from the validity of the original finding.") %>% 
  ggplot(aes(y = str_wrap(name, 50), x = f, label = paste0(f, "%"), fill = value)) +
  facet_grid(cols = vars(Author)) +
  geom_col(position = position_stack(reverse = T), alpha = 0.8) +
  geom_label_repel(position = position_stack(vjust = 0.5, reverse = T), alpha = 0.75) +
  labs(y = "Sentences", x = "%", title = "Agreement with selected sentences") +
  # guides(label = "none") +
  theme_minimal() +
  theme(legend.position = "bottom")